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ABSTRACT 

Aside from the theoretical issues involving the 
4 val^dity" of inferences f roiiN survey the basic problem of producing 
unbiased estimates of regression parameters and estimates of . the ; 
associated standard errors has beeiil -a" particularly difficult issue in 
dealing with results f£om stratified multistage sample designs such t 
as the one used in the National Longitudinal Study of the High School 
* Class of 1972 (NLS) \ The purpWe of this port is tp review some 
appropriate available techniques that*may be useful in applying 
regression models to* tne NLS' data, The* first section provides a 
framework for evaluation and an appraisal of some^ alternate * 
approaches within this framework. Al preferred approach (combining the 
Horvitz-ifhompson estimator , and taylo.rtzed deviation) is..compared tA» . 
an Ordinary Least-' Squares approach,! through a,, simulation procedure 
using actual NLS data* The several results are summarized. Formulae 
underlying the preferred approach areijprovided separately in 
Appendixes A and B, and details of ; the] development and use of a 
cdmputer program to implement %he apprl>,ach are prdvided in Appendixes, 
C and D, (Author/BW). . [ ' 
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' - I . INTRODUCTION 

• # * 

the National Longitudinal .Study (NI§) of the High School Class of 197£ is % 
a large-scale sample survey sponsored. by the National Center £bt Education 
Statistics (NtES) . The sample design for. bhirs survey can be described as a 
deeply stratified£wo-stage design with 600^final strata. The original design 
Called fot 1,200 schools and 18 students per school (size permitting). A 
tdtal of 1/069 schools and 16,683 students participated in the base-year 
survey, which was conducted by Educational Testing Service. ( An additional* 
follow-up of nonregpon^ent schools, plus additional backup schools and augmen- 
tation p(f the sample for the first folloW-up,, increased the number of partici- 
pating schools to 1,318 and the totaf 1 ^tudent sample to 23,45L. The numbers^ * 
of respondents, to the first, second, third, aa<^ fourth follow-u^ questionnaires, 
administered by the Research Triangle Institute (RTI), were 21,350, 20,873$ 
(2a, 092, and 18,630, respectively. ' L 

' As suggested above, a large amount of^dai5a has been collected for this 
study. The types of statistics' required to address various research questions 
of interest range from simple descriptive totals and means to more complex 
analytic statistics, such as regression coefficients, but- the j)roblem§ of 
drawing valid and relevant infereuces, which— are' common to all multistage 
sample* surveys, must be addressed in analyzing .the NLS data. For complex 
statistics, such as regression coefficients, t^here are jio "pat 11 solutions to 
these problems; however, i the need still exists for some good, even though 
"imperfect techniques to approximate these statistics andy their errors. 
^ . Aside 1 from the various theoretical issues involving the validity^ of 
inferences from surveys, the basic problem of producing unbiased estimates of N 
regression parameters and estimates of the associated standard errors has been 
a particularly thorny issue in dealing with results from stratified multistage 
sample designs such as -the one used in the NJ,S. Most of ^tfte available statis* . 
ticfcl software packages [such as SPSS (Nie, et al . , 1975), SAS (Barr, et al. , 
19^6. 9 1977), BMDP (Dixon, 19750, or OSIRIS (Rattenbury^and Eck, 1973;' Institute 
for Social Research, 1973X4 treat tlfe sample, as independent random, observations., 
ignoring the sample design. This approach is convenient but theoretically 
inappropriate, 0 since" it do£s^ not account for unequal probabilities of Selection 
or for effects of stratification and/or clustering. Th^application* of sampling* 
weights is possible through some software packages, allowing correct* estimates 



of regression coefficients, but appropriate error variance estimates typically 
are not .produced. % In fact, it is not possible tcTobtain explicit expressions 
for variance estimates of complex estimators such as regression coefficients 
within complex survey sample designs; however, various approximation procedures 
are available. ^ 

The purpose of this report is to, review some appropriate available tech- 
niques that may be useful in applying regression models to the NLS data. The- 
following section provide^ a framewprk for evaluation and an appraisal some 

0 

alternate approaches within thi^ framework. In Section III, a preferred 
approach (combining the Horvitz-Thompson^estimator and Taylorized deviation) 
is compared to an Ordinary Least' Squares approach, through a simulation proce- 
dure using actual NLS data. The several results are sununarj^S^^n Section IV. 
Formulae underlying the preferred approach are provided separately in Appen- 
dixes A and B, and details of the development and use of a computer program to 
implement the approach are provided in Appendixes C and D. s 
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II.; ASSESSMENT OF, ALTERNATIVE TECHNIQUES 



Survey research in the social sciences is often based on large complex . 
samples, from which inferences are made regarding the population under study. 
The most .common practice for drawing inferences about a univariate parameter. . 
is to~~ assume (6 - 8)/s(8) has approximately -the Gaussian or Student's t dis- 
tribution where the statistic 6 is an estimate of the parameter 8 and s(8) is 
an estimate of the, standard error of 8. Similarly, for a multivariate parameter - 
8, represented as a row vector, inference may >e based on a Hotelling's T' 
type statilfcic of the form £8 - 8){V(6)f 1 (8 " wh ich is assumed .to have 

a chi-square or transformed F distribution in repeated samples, where V(8) is 
an estimate of the variance-covariance matrix of 8. 

The justification for such an approach is based on the assumption that a 
'generalized cental limit theorem applies to large complex probability samples r ^ 
from/finite- populations, David (1938), Madow (1945), and Hajek (1960) have 
established such results for the mean of a simple random sample by letting the 
population size increase at the same rate as the sample size. For survey 
statistician* concerned with finite "population inference, the regularity with 
! which sampling distributions "for properly standardized survey statistics can 
be expected to follow classical distributions' continues to be' one- of the most 
important unansweilVd questions. r 

•The problem is further complicated in the case- of regressionjnpdels in 
the specification of 8 and t£e standard' error of 8, where V s a vector of 
regression coefficients. A variety of models ari interpretations have been 
suggested [e.g., Konijn (1962), Godambe and Thompson (1971), Royall (1971),, 
Kish and Frankel (1974), Fuller (1974), and Folsom (1974)]. 



A. Overview and Notation 1 _^ 

For the NLS survey, schools were^stratif ied by several characteristics to^ 
obtain #^600 strata (Westat, 1972). Within each stratum, h, m h (mostly 2) . 
schools were selected at random from the t^otal of M h schools in the stratum. . 
Within each school, (i.e, the ith school in the hth stratum), a random sample 
' of n h : (mostly 18) students from the total of N hi ' students within school hi ^ 
were selected for survey. ' A 



* Although the discussion in this and su>sequent 'sections is sometimes specific 
to the NLS survey design, the- results are generally- applicable. • • 

\ 

9 . • ' _ 



Within this context, estimates are ^e^ired; for example, consider an 

estimate O of the national total number of high school seniors who were in an 

academic curriculum. Let X, . be .the estimated number of students with ^academic 

ni * /• 

curriculum in the (hi)th schqol, h = 1, 2, ...', H and i = 1, 2, m^ 

If the sample of schools within each stratum i? .selected with equal. prob- 
abilities and with replacement, then thfe estimated total, ,T, and an unbiased 

* « A A ' 

estimate of its variance, V(T, ), are 

H M* m h * H "h * ' 

T = Z — Z X K . = Z ' Z*Y. . , (1.1) 
h=l "h i±l bl h=lVi=l hl ' „•* - 



and. • . 



•V(T) = I % I (Y h . V \) /(m h " » \ ' * ' ( ^ 2) 

h=l i=l ^ . . - x 



A A 

where Y hi = M h /m h , 



• m h 



and Y h =' * Y hi /m h ' . 

1 = 1 4 

,If an approximate estimate of the size of schools (s^^ = size = total 
"number of students in the (hi)th school) 'were known, then one could use a 
biased jiatio 'estimator , T, , given by 



H "h . 
Z 1 
h=l i=l 



T, = Z Z Z. . f • ' (1.3) 

ni 



where Z hj _ ? s^./m^. 



. M h . 

and • V = * s hi 

1=1 



-The estimator T^,' which may be recognized as a Horvitz-Thompson (1952) estima- 
tor, is an unbiased estimator of the total,, if the probability of selecting 



the (hi)th school is = s hj/ s h+ on . ea ^h of m^ 'draws. An unbiased estimate 
of Its variance is. given $s * + * ^ 

' • • • * <v '• 

h=l i=l • ' ' » 

where* ' = I Z^/m^ . , — '* 

-v- i=l _ ■ f 



This appeals to be ^J^e^ most "common approach ir^ many sample surveys, ^The* ' 
absence of bias in the estimator T,V and the estimate of its variance V(T, ) are 
established .over repeated samples with* the primary sampling units seleated 
with unequaT^^obfebilities and with replacements K 

While the prior discission has been directed to providing an example of 
estimation within the'^NLS study, toward "inti&ducing notation, it also has 
illustrated the way in which. the sample design or the conceptualization of all 
possible samples affects, the variance! of the estimator based on only one of 
the- -samples . The freedom, of survey designers to define the sampling distri- 



butions has raised several fundamental issues regarding various statistical 

estimates^v A full discussion of these issues i^ not within the scope of this^ 

report; however,' the practical projblem of selecting -en appropriate estimator 

and tfn« estimate of the variance of that estimator must still be addressed. 
* /'. • . * • • 

'/The estimator chosen, for the current purpose is the Hdrvitz-Thcfmpson * , 

' < i+ . ' ' 

estimator. For sampling -witfc unequal probabilities , this estimator i$ used 

* ** 
widely itf practice and 'ha* been found to be an admissible estimate. The 

HorviCz-Thompgpn estimator is not the "best"- estimator in ill cases, but the 

same can be saifr pi any other estimator. When probabilities of selectidn are 

based on priori informa£ion v 'about size and .the relationship af size, to the 

characteristic of interests * the Horvitz-ThQmpson estimator* is optimal or 

nearly optimal. * "^v • 

% / * , ; 

The choice, of this estimator is not as arbitrary as it may appear (Shah*,. 

1980); however, there at£ few practical rules to support the choice; The most 

common advice for selecting aiv gstimator^is* to examine the .data before, deciding 

* * • * l * , , ^ 

which Estimator is optimal. Ah expert in survey dfesigri and thjfeory may be able 

• " * \ 

to reach such a decision because of ^past experience and knowledge^. 0 Other. 



researchers may need a catal&g of alternate estimators and\a set of rules that 
will enable 'them select the optimal estimator... At, present* no such guide- 
lines exist except, for such vague statements as, ,f If probabilities of selection 
have n6 jrelatiqn to the characteristic to be measured then the simple mean - 
would be better than the Hotvitz-Thompson estimator."' The survey practitioner 
obviously needs better- guidelines for choosing estimators, but^until >such time. 



as these rules becoms^adfoilable*, the survey practitioner proba>ly will continite 
to. use., the Horvitz-Thoinpson estimator, <which i's optimal in most cases, eveq 
though ±t may be 'inefficient in a few situations. •* . 

A Second important consideration, which is often, neglected, is an estimate 

• ' >' 

of the* variance of t\e eStimatot^ The "estimator that one' uses^ may^ or may not 
be optimal and may o^aay'not be efficient, but it i? imperative that some^ % 
proper estimate of tlje variance (mean square error) of the estimator be computed 
from the data. The proper evaluation % *of mean square error assumes an additional 
dimension of importance when the estimator used is not unbiased. Guidelines * 
for s^electing^ from among available, mean square error estimators also are not 
readily available. » 

In ^the case of* estimates of error variance, there are additional' consider- 

n ations/ ^or the exiqjple given a$>ove, the total is € a simple linear function of " 
the observations, and, it is possible to derive explicit algebraic expressions 
for estimating* Variances of such linear functions. Howeyer, it i*s not possible 
to obtain such explicit expressions for variance estimates of complex "estimators 
.such as a regression coefficient or a .correlation coefficient. 2 There 4 a re, 
owever, various available approximation procedures; some* stich procedures are: 
(1) Taylorized deviations,* (2) independent replications,* (3) balance" Vepeate9 

'replications, and $) J^ckknife. " » ^ r \ ; 

B.^,, Application of thp Central Limit /Theorem 

* Assuming the estimate and variance for the total (£. 1) and (1.2), let the 

yector "T^ =" (.tj N r .«« , ^jO^ repre^efirtr^the totals of k variables (x^y.x^f^^ 

t.., %1) for the r kth stratum. An estimator of total T u and its varianc^-covari- 
k ^ * * * n 

ance matrix* V(T,) Q£n be obtained, using formulae similar to (1.3)? afid CI. 4). 

n % * . * 

Further, let the vecttfr T denote the sum of the* vectors T, . Since the* sampling 

* j »» • « n * , 



1 * s . ' t # . ' • ' 

2 It should be note# that this difficulty; with complex statistics is common to 
all ,b ranches of statistics ^nd is not a distinctive feature of sample surveys.. 



within one stratum is independent of sampling within another, it follows that 

a' - h ' * . 

T = I T. , • ' * < ' (1.5) 

h=l ■ ' -sft * 



and an estimate ofi^tfte variance-covariance matrix of T is 

V&T) = Z V(TJ ' . c (1.6) 



h=l 



If a large number of strata 3 are involved and it is ass\jmed that the 
first tv*o motoents of the distributions of (h = 1, 2, \ . . , ^1) satisfy certain 
convergence- properties (e.g.,, Lindbeirg conditions), a general fornj of the 
central limit theorem would apply (Feller* 1966); hence, the limiting distri- 
bution of T would be multivariatfe normal. 

If 6ne *is interested in estimating the variance of a statistic, 6, which 
is a nonlinear function of T, t then the approximate normality of T is not 
necessarily useful in estimating V(6). Examples of such nonlinear functions 
are: 

Zw, x, v ^ * \ 

fi - h h i 

1 " Zw • . a ■ • v< 

1 * w h i 




The statistics 8^ and 8 2 can be readily recognized as the weighted mean of x , 
and the,weighted correlation between x and y, respectively, where/w^ represents 
the weight. % \. ' 



3 If sampling "o^ Sjrimafey Sampling Units (PSUs) is with replacement, the same 
arguments can be' made at PSU levels* 



C. Variance Approximation Procedures "* ; 

As stated previously, four relatively! common approaches to appropriate 

& * 4 

'Variance estimation are (1) Taylorized deviations, (2)' independent replica- 
tions, (3) pseudo-replications, and (4) Jackknife. Brief descriptions of each 
of, thesje techniques, th^ir assumptions, and their strengths and weaknesses- are 
provided in this section* 

: 1*. Taylorized Deviations * ^ 

A classical solution' to the estimation problem has been to express 
the statistic 0 as a polynomial in'Ctjyt^, . .., t fc ) elements of £he vector T, 
using the Taylor Series expansion. The approximate variancfe of* 6 can* then be 

obtained by using" only, the linear terms of this expansion (see Kendall and 

v i , 

Stuart, 1973). 

' ' If (66/6T) is a row vector of derivatives, {66/6^, 66/6t 2 , 6e/6t k } , 

then the approximate variance of 6 is estinfete$by ^ 



AAA 



> A ^ 

V(6) = (6e/6T)V(T)(66/6T)- , 

which can be further expanded as 

° , . H • - - x J - \ 

V(6) = I (66/6T h )V(T h )(6e/6T h )' . 
h=l . • <? 



For large values of H, it is assumed that the distribution of 6 will be 

A A 

approximately normal with variance V(0).- Such expansion for ratio estimates 
is presented in most textbooks* On sample surveys. The first-order Taylor 
Series expansion for regression coefficients has been derived by Folsom (1974) 
and Fuller .(1974) . Wooc&ruff (1971) has presented an algorithm for \obtaining a 
first-order Taylor Series approximation to compute the variance ^af any complex 
statistic* Programs for Taylorized deviations are available from Hidiroglou „ 
and Fuller (19^5) , 'Holt (1977), Kish et al. (1972V, Shah (1974), and Woodruff 
and fcausey (1976) . , f 
, 2. Independent Replications 

.The piost|.straightforward way to avoid assumptions would be to draw 
several independent samples from the same population and, thus, to obtain ^ 
several independent estimates of the same statistic 6 (i.e., 8^, 6^, ...» 8 r ) • 



14 



The mean estimate, 8, woqld be 



\ 



8=1 8./r , x . 

■ . . 1=1 1 ' .•••.•>.,!. • ■ 

and an estimate " of the variance, V(8)Vfis given by- 



V(8> = I (8. ^ 8) 2 /r(r - 1) 



1=1' 1 

» * 

In practice, iowever, one wodld like to compute 8 using data from a ll _the 

samples, and for, complex statistics 8 will not necessarily be egu^l to 8. It 

is then necessary to assume that V(8) is approximately equal^to V{8). 

A practical problem exists with this technique, in that it places seVere 

restrictions on- the sample .design, since each independent sample is much 

smaller than the' "total sample" feasible with limited resources. Further, 

resources may similarly constrain the/ number - of independent replications 

(samples) to be small; consequently , t(hejea£imate of the variance would have 

few degrees of freedom and would t^nd to be unstable. Additionally, in the 

'oase of multivariate analysis where 8 is a vector of dimension P, if P > p, 

then the estimated varialice-covaria*ce matrix V(8) will be singular. 

3. Pseudo-Repljcations < ^) % 

* An ingenious but- simple* approach was suggested by McCarthy (1966) 

for designs with exactly two primary sampling units (P^SUs) per,, stratum. A 

;ran^pm half of the sample is defined by randomly selecting one of £he PSUs in 

*each. stratum; the half sample and its complement are assumed to be "approxi- 

mately" ^independent samples. Thus, an estimate' of th£ variance with one 

* m : * V" 

degree of freedom can be computed using two half samples. Of course, it is 

necessity tp assume that the v Variance / of the statistic based on the total 

i f ,y . ' . • J 

sanjple ^is approximately half that of the estimate based on half replicates. 

Since there are 2 possible hal£> samples, many pairs of half samples can be 

•selected. , In practice, about AO to 100 pairs of half samples are selected' to 

provide reasonable estimates of the variances. 

The determination of the approximate degrees of freedom for the estimated 

variance remains an unanswered -question. The practical approacjj c is to assume 

degrees of freedom equal to the number o£ strata or the number of pairs of 



half replicates, ' whichever is smaller. If both of .these are large (i.e., 
greater than 30), then, in practice, the actual value is irrelevant since the 
t or F distributions can be approximated by the normal or X distributions/ 
respectively. * ^ 

4. Jackknife h - 

•The "Jackknife'^ approach originally suggested by Quenoille (1956), 
and so iiamed by Tukey (1958), is an intuitive approach to computing variances. 
A definition pf " Jackknife" for a multistage survey design in which all stages 
are random is presented by I>lsom et r al . (1971). Kish and Frankel (1974) have 
suggested an apprpach for a stratified sample with two PSUs per stratum; 
Ijpwever, no general definition is available for a stratified multistage sample. 



D. Additional Considerations \ * 

1. Computation , • \ 
Most of the widely used statistical packages (e.g., SPSS, BMDP, 

OSIRIS, % SAS) do not routinely provide for computing proper variances of a 
weighted statistic? from a multistage sample survey. Except at institutions 
with large statistical and computational resources; the computation of such 
standard errors frequently is not attempted. 
^ Frequent complaints a>re that the cost ok computing variances is excessive 
and that standard software, for the computation is not available (the cost of 
special purpose programming being prohibitively expensive). For example', the 
cost of computing the variance *of a weighted jpean may be 10 to 50 times that 
of computing the mean. While this may be thej c^se for some techniques or u 
programs, JITI's experience in using the Taylorized deviation approach, is that 
the total cost^of computing variances is only about twic^/that of computing 
only the mean. Moreover, several general -purpose programs have become av&ilabl 
'recently (see subsection II.C.l, above). 

2. Estimating Variance Components - 
Many surveys are conducted periodically, and there is a need for 

evaluation <ff survey designs used with a view to possible improvements in 
subsequently designing similar surveys. To make decisions about such designs, 
there is a need to estimate contributions .to the variance of a statistic from 
various elements of the overall design such as stratum", PSU, and individual; 
in other words, estimation of variance components is required. Of the techniqii 
discussed above, Taylorized deviation is the only one that permits estimation % 



♦ of variance components *(see 'Shahet al., 1973; Moore et al. , 1974). Since the 
estimator is expressed as/a sunS of random yariables^ the variance components 

of 6 can be estimated in the same manner" as that of T. 

♦ s " \ 1 

-E, ' Comparison -of Techniques ' y ' , 

To compare the .techniques, the following criteria are used: 

1) validity or number of assumptions required, 

2) restrictions on sample design, 

3) computational problems tor large data sets, and. # 

4) flexibility of applications. 

A summary of the comparison is presented in Table 1. From the comparison, the 
Taylorized deviation, approach alpears to be best,, if one is Willifig to accept 
applicability of the central ljamit theorem. Furthermore, if one needs to 
evaluate components of variancJ, then Taylorized deviation is the only approach. 
If there are only two PSUs per stratum in the design, pseudo-replications 
'would be, appropriate. 4 The" Jackknife approach should be considered only in 
the rare case of a complex design and for a statistic for which it is not 
possible to evaluate derivatives. The independent replications approach* will 
be suitable only if the sample is designed appropriately. 



Conclusions 



The recommendation supported by the discussion in this section is that 
forvmdlt nontrivial survey designs,* the Horvitz-Thompson estimator and a 
Taylorized deviations approach are typically the most appropriate and practical 
techniques for computing parameter estimates and associated estimates of the 
variance, including estimates of regression coefficients. The choice of the 
Horvitz-Thompson estimator is based partially on intuitive grounds but is als<y 
supported theoretically (Shah, 1980). The choice of Taylorized deviations was 
made for the fallowing reasons!?** (1) applicability to all designs and statistics; 
(2) .a'pplicability'to y -Hrge samples; (£) ^economy and computational feasibility; 
and (4) capacity for estimating variance components. . 

The assumption underlying the Taylorized deviations approach is asymptotic 
normality. The assumption of approximate normality is in use ifi other contexts, 
and some rules o'f thumb have been developed (e.g., a binomial distribution is 



Although the original NLS survey design had two schools per stratum, the 
ultimate design had several strata with three or foil* schools. 

i * 
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tab^e 1. — Summary of comparative evaluation' 



t f Technique 



Criteria 



Assumptions 



Restrictions 
on sample 
design 



7 i 

Computational 
problems if J^ 



Flexibility 



Independent, 
replications 

^ Psuedor < 
"^replications 

i 

* Taylor ized 
deviations 

!• : 

% . iJackknife 



w Minimal 



Independence of 
complementary 
half replicates 

General central 
limit theorem 



Intuition 



Severe 



Two PSUs per 
stratum' 



None 



* None 



Simple 



Significant 



1 

Not difficult 



Greater than 

Taylorized 

deviation 



Can be. used 
for variance 
components , 

.May be useful 
for some 
designs 



1"8 



IS 



\ 



approximately no^al if npq is greater than 10). There is an obvious need for. v 
developing 'such simple rules of. thumb for statistics resulting, from survey 
samples; howeveV,- until -more information is available, the suggested approach- . 
is Taylorized .deviations using any of the available programs , (Hidiroglou ^ 
et al., 1975; Holt, 19*7; Shah, 1974; or. Woodruff "and Causey, 1976). In 
practice, one should consider certain transformations of statistics that J_ 
rapidly converge to normality;, as an example, if r is the sample correlation, 
then evaluation of the variance of Tanh _1 (r) may be more appropriate. ^ 

A. development of the Taylorized deviation^ approach for regression coeffi- 
cients is -provided in Appendixes A and B, for the interested reader. A flexible 
and easily used computer program applying Taylorized deviations to the tomputation 
of regression coefficients^ and their standard errors for data arising from f 
multistage samples is d'estribed in Appendixes C and D. This program is available 
from the senior author of this report. 
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/III. EMPIRICAL TESTS' OF TWO REGRESSION APPROACHES 

'. \. : . ' . t , •. . o 

- .-Previous discussion -his indicated the theoreticaF*superiority, under 

assumptions of approximate nonqal^ty, of a 'Combined Hoi^itz-jThompson estimator 

apd a taylorized deviation variance approximation a'pproach to- the investigation 

of regression models with data arising from complex survey samples. Nonetheless, 

there is nejed for some empirical evidence of the verity of the approach;, 

consequently, a simulation procedure was undertaken, using the NLS data base. 

The study involved drawing a Targe number of random samples from^^iir&P^ 

population and then deriving pertinent statistics from these samples, to 

evaluate the. [distribution of the regression coefficients and that of the 

approximate F values computed by Taylorized deviations. 

The simulation also allowed a natural vehicle for evaluation of other, 

less appropriate, approaches to regression analysis as compared to the suggested 

approach. ^Dne of the most widely used approaches to regression analysis 

ignores -the sample design and addressed the dat£ as though they arose from a 

simple random sample. This approach, using "Ordinary Least Squares COLS) 

criteria, owes much of its popularity to the facts that it is better kiiown 

than the more appropriate techniques and that it is easily applied through all 

of the widely used statistical analysis packages. » Nathan and Holt (1980)' have 

demonstrated that in most cases the regression coefficients computed by applying 

OLS solutions to data collected from complex survey designs will be biased, ~ 

although an exception occurs* for epsem designs. Moreover, they shojied that, 

unde?r these conditions, the*0LS variance estimator is consistently biased even 

in those cases for which the OLS regression coefficient estimates themselves 

are unbiarsed. . While the proper applicatio^of sampling weights, within some 

♦ 

standard statistical packages, can produce unbiased estimates of the regression ' 
Aef ficients, the weighted variance estimate produced by most packages remains 
biased. Moreover, Tjpf's experience with this latter approach suggests that 
resulting variances show considerably greater bias than those obtaiifed through^ 
OLS* - For these reasons, OLS was chosen as the^ comparison approach to Taylor . 
Series Linearization (TSL). '• 

o ! 

A. Method * f V 

The NLS> >ase-year sample was taken as the finite population for this 
simulation/ The original sample design consisted of 600 strata, with 2 
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schools selected pet Stratum. Within each school, an equal probability sample^ . 
of roughly 18 seniors^was selected. For, the simulation, 84 major strata were ^ 
M formed by combining similar strata, each containing at least 10 schools. So 
as -not to confoiund results' of the simulation study with problems of missing . 
data the. finite population was defined to exclude student with missing da^ 
.elements for any of the variables us<?d in the regression models. Consequently,. m 
the study population contained 935 schools and 10,657 students. • * . 

The simulation consisted of selecting 1,000 random samples from the 
■defined population. Each sample Gas selected in twQ . stages. Within each 
stratum^two 1 schools were selected without replacement with probabilities,^* 
* proportional to estimated senior class enrollments; Durbin's (1967) metho<^was 
used for these selections. From each of the schools in each sample, fiv£ 
responding students were selected with > equal probabilities and without' replace- 
ment; thus, each resulting sample^ consisted of £40 students. ^OLS and* TStt 
values of regression coefficients and their associated Variances, cova-riance?, 
and F values were computed for ea£h of the* 1,000 samples afid for 4 regression 
equation?. ( ' j . t 4 

.Two basic regression models were selected fbr the NLS simulation study/ 
'and within each moxiel two related criterion variables were. used tb.indicate ■ 
\the type of postsecondary education being received by v individuals in the fall 
of 1974. This ^resulted in four regression equations for evaluation, although 
the tJ^ criterion variables for each model were similar (both related t|^$rpe 
of .postsecondary" entry-, but one was a dichotomization of the other). Tn 
predictoir variables represented characteristics of*^e high school senio 
.prior to graduation in 1972, The two basic regression models are written 
. symbolically below. * - 

Model 1: INC (or TYP*) = INT +i SEX + SES + GRADES + GOALS.* 
J Model 2: INC (or TYP) = TNT + SEX + SES + ABIL + RACE~*PR0G. 



e 
rs 



i The variables used in the 4&o models are defined below. t 

INC f l'.if the individual had enrolled in some type 
after high school; 

0 otherwise. ' v > 




education . 



9 

ERJC 
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TJP = type of college enrollment, scaled, as follows: 
1 4^if 4-year college; 

* % 3 if 2ryear college; x ' 9 

2 if any other regular br^vocatioiial college; 
• 1 if not enrolled in any college* 
P INT = the model intercept.^ 
SEX = 0 if female; 

1 if male. -» 1 * ^, 

SES* ^composite socibe'conoinic-edpcitional. status derived from several 
base-year questionnaire items (see Dunteman, et al?, 19743- 
GRADES = self-reported, overall, high-school grade r&nge <8 # levels). 
GOALS = a quantitative measure* of educational or 6ther aspirations 
derived frdm* base^ear questionnaire .respi/nse (see DuntemaV, 

et al. f 1974). ' ' '\ '< ' • / ■ • 

ABIL = an ability score based on £e^t items administered during the 
base-year (see Dunteman, et al., 1974). 
RACE*PROG = indicator variables for the joint contribution of %ace/ethnicity 
and high school progr§m including their interaction, where r> 
R*p = 1 if the high school program was' academic and race/eth- 
nicity was majority wnite, v * 
0 otherwise; • •* j 

»R*?2 = ^ t * le high school • grogram jgas academi'c aJM* race/eth- ' 
, nicity was any minority, / 1 A J ^ 

. - .0 otherwise; ^ 

* -^.R^Pj = 1 if the high school program was Honacademic and the 

race/e,thnicity was majority white, 
[ ' 0 otherwise; 

• R*P. = 1 if the high school pro'gram was nonacademic and the 

4 ✓ . * ^ . ^ 

-^ce/ethnicity was )any minofity, 

»0 otherwise. 

The regression models were evaluated using both OLS and TSL, as applied , 
through procedure SURREGR described in Appendixes C and D. In'full mocjpl 
(ALL) hypotheses, the intercept was excluded. # Al?o*, the v RACE*PROG hypothesis 
of model 2' was reduced to rank 3*^>y eliminating the B*P A variable. The variance 
of the regression coefficients, the mean difference from, the corresponding * 
population value, and the standard error x of the dfean wer6 , computed over the 

1 • A . * 



. 1,%)0 samplts in "order to evaluate the possible bias- in estimates of the 
^. coefficients. Means of the 'estimated variances were also computed.. Additionally, 
3^" the values of the F statistic for testing the hypothesis that the regression 
coefficients vere equal to known population values were computed.- Since the 
' . -hull hypothesis is true, the observed values should resemble the * theoretical F 
• distribution. The'actual numbers of observed F values falling below various 
• percentile /points of appropriate F distributions were tabulated for this 
comparison. , 




B. 



Results 



The discussion in the previous, section leads to' three predictions which 
should be observabJLe 'from the results of simulation, if the estimators and 
their variances 'are unbiased. 

1) The expected value of the difference of each regression coefficient' 
from its true value over all samples' should be approximately 0; 
therefore, the mean value over all samples of a regression coeffi- 
• f cient shdUld fall within the interval defined by the true" value ± 3 
times its standard deviation. 
<=. 2) - The expected value of the variance of a regression coefficient which* 
was computed by the Taylorized "deviation method should be appro*- • 
mately equal to the variance of that regression coefficient over all 

samples . , • . „^ „ , 

3) The percentage over all samples'; of tlie statistically significant F 
N values for testing .a hypothesis "about the dif f erenc^Jf computed - 
coefficients from known population values should tafapproximately 
equal to the nominal significance level. 
Summary statistics to check~the validity of predictions 1 and 2 are 
presented in Tables 2 and 3, for the TSL and pLS approaches, respectively. ^ 
The first two columns of each table define the four regression equations 
lexamined (i.e.,' the criterion variable and predictor variables of the two 
basic models, respectively)/ The entries 'in column 3 give, for- each predictor 
variable, the average (over the 1,000 samples) of the difference between the 
estimated regression coefficients and the actual .population value of tb^ ~* 
coefficient. The estimated standard errors o'f these mean differences are 
given in column A. The variances of the estimated regression coefficients 
"over the *1,000 samples are , provided in column 5„ and the averages over the 
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Table 2. --Statistics describing the distribution of the estimated 
. ^regression coefficients over 1,000 samples from a .finite 
population using Taylor Series Liijearization * * 
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Table 3.— Statistic/ describing the distribution of t^ie estimated 
regression coefficients over 1,000 samples irom a finite 
- . population using Ordinary Least- Squares 9 
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Criterion Predictor 
variable . variable 



Mean 
difference 

, from 
population 
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Standard 
error 
of the 
Wgan 



Variance of 
the computed 
coefficients 



Mepn of 
the computed 
variances 
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-INT ~ 
SEX 
' SES ' 
GRADES 
GOAL 


-0.025980 ^ 0.001963 
-0.004871 0.000834 
-0.004038 0.000607 
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1,000 samples of th^ variance estimates computed for ^ach sample are given in . 

* . * * 

column 6. / • 

Prediction 1 .can be examined from the entries in ^column 3 and 4 of Tables 2 
and. 3. -\The meanvdifferences of computed and actual regression coefficient 

values. are clustered near 0, 'ranging from -.064 to .027 for the Horvitz-Thompson 
and Tayiorized deyiafion approach and from -.075 to .004 'for the OLS approach. 

'With' few exceptions, however, /the confidence intervals of three standard 
errors about »t*hese means did.nat include th^ value of 0, Whith^implies some 
bias in estimating the regre$si6n coefficients by both of the- app>oaehes. 

Prediction A can be examine* from the^entVies in columns .^an<J 6 of 
Tables 2 and 3. \he TSL* variance for . each sample was computed according to a 
equation (A. 32) aS' provided' in Append ixT A. , The average, over samples, . 'of the 
TSt variance estimates is quite comparable to, the actual variance, over the 
1,000 samples, of the^ computed regression coefficients {Table 2). Similar - 
results are also observable for the analogous OLS statistics (Table 3) ♦ 

Summary* statistics to check the validity of prediction 3 are' provided in 
Tables 4 and 5. These tables indicate for regression models 1' and 2, respect 
tively, a comparison of the upper tail of the appropriate theoretical T distri- 
bution's the empirical distribution of F values computed for' each hyp&thesis 
in -each of the -1,000 simulations. Within each of these tables,, res&lts -are 
presented separately for each of the criterion va/iables considered in the 
particular model and for TSL and OLS .approaches. 

The TSL solutions appear* to w give good approximations for both models and ^ 
for both criterion variables. Using an average of the empirical distributions* 
over the various hypothesis tests within model and criterion variable, the. TSL^ 
solutionV^can "her reen t6 approximate the theoretical percentage points .quite- 

*well. With one exception, such averages differ from nominal Values bf no more 
than one-half of a percentage point, "and all such differences are in a -conser- 
vative* direction (i.e., suggest the null hypothesis woul<J have v been rejected 
less frequently than suggested'by the nominal significance level). 1 

in general, the OLS solutions also provide good approximations * to the 
theoretical redistributions. The average of empirical distributions suggests 

- that OLS solutions tend to err irt^a nonconservative direction and that the 
error" i^ greater i'n modeling the criterion variable TYP, Even though the ^ 
average, .differences from the theoretical distribution are still relatively 

• small in .an absolute sense (at most 3.5 percentage points), a question is 



Table A^-The number of . sample F values (out of 1,000) which fell below 
specified percentiles* of- the appropriate F distribution for 

Model 1 v 
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Table 5.— The number of sample* F values (out* of, 1,000) which felt below 
specified percentiles of the appropriate, F distribution for 
• ^ Model '2- 
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raised as to ^the extent to which the applicability of OLS .approximations is 
situational (se/ particularly the poor f£t for- SESf^ift Olg solution for TYP). 

■p •• . 

t. Additional Simulations 4 » 

Although not specific to. the NLS data, simiiar simiUations (Shah, et al. , 
1977) compared TSL and OLS under a wide/ range of sampn^e situations with 
different populations defined from the Health and Nutrition Examination Survey 
(cf., Public Health Service, 1973). In general, these results further support 
the contention that 'the agreement of ' OLS solutions with the theoretical F 
distribution is situational , o Using. 24 strata and selecting, first, 2 of 12 « 
PSUs per stratum and, second, cluster sizes of 10 from each PSU, regression ; 
models similar to tHose used in the ^LS simulations were employed. * As with^ 
the NLS simulations', TSlf solutions, in general, .gave only marginally better , , 
results than OLS; however, the performance of the OLS statistic was/again 
generally noncbnservative and was better for continuous variables and for the 
case of greater between-PSU homogeneity {which reduces clustering effect^) . , 
The performance of the TSL statistic .was again relatively stable over. all 
conditions . . * 

In the special c^se of the application of regression models to solutions 
for domain* m^ans, the overall superiority of the TSL statistic was quite ' 
pronounced. 5 Under these conditions, not only did the OLS statistic generally 
show 'considerably less congrue^ck^than TSL to the theoretical F distribution 
but also the, congruence of the OLS statistic varied dramatically with different, 
cluster sizes, different strata definitions, different between-PSU heterogeneity, 
and different models. With, the exception of the prediqtion of domain meafrs 
for race/ethnicity with a-small number (8) of defined strata, the TSL F statistic 
was^ quite consonant with the theoretical F distribution and otherwise showed 
little variability from situation to situation. 

D. 'Concluding Remarks / * 

It Should b^ recognized that the results af the several simulations offer 
only support and not definitive proof of the general applicability of the TSL 

f * 9 . • * I 

— ' ■ ♦ * S 

5 ' Although- this is a fairly atypical application of regression modeling, it t - 
"cjoes demonstrate the general applicability of the TSL approach over a wide 
variety of situations and the lack of such applicability for the OLS ^approach. 
For^ a technique q£ adjusting standard errors^for domain means computed ftom v 
NLS da ta^ see Williams, 1978. 
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approach. • No amount of empirical data can conclusively prove that any statis- 
tic provides valid inference in general* The simulations with NLS data were 
quite limited, restricted to two relatively simple (and related) m regression 
models crossed with two (related) criterion variables. The consideration of 
results from the additional simulations provides a somewhat broader but still 
limited base for conclusions. The additional simulations examined only three 
regression "equations (for each of two defined sampling frames) of a form , 
similar to those examined ill the NL^^imulations . The additional results aiso 
included 64 simulations of the special case of applying regression modeling? to 
computation of. domain means. • While these latter results are certainly* germane^ 
to the general applicability of a statistical procedure, they do not directly 
address more conventional- regression a approaches . ^ 
Also, it should be recognized that the actual NLS data base differs from 
.the NLS simulation in some important ways. While ttfe NLS simulation results 
were based on a cluster size of 5, the actual NLS cluster size is 17. Other 
things being equal, increased cluster s^ze tends to increase the variability 
of statistics. As an example, for a* simple statistic (note that a regression 
coefficient is not a simple statistic), t\$e impact of cluster size, m, on the 
variance, a 2 , cln be indicated by the straightforward equation 

o 2 c = {1 + (m-l)p}a 2 / . • 

where a is the variance including the clustering effect and p is the intra- 

c - 

class correlation. x ThU"s,' the clustering effect for sihple statistics ,< (m-l)p, 
would be expected to be four times larger with the actual NLS data than in the 
simulations ' (i.e. , (17-1)7(5-1) = 4)\ with equivalent values of p. Further, 
evpJt if p were as small as .02 an increase in actual variance of 32 percent, 
(i.e., 16(.02)) over "that of random* sampling would be expected fop simple' t 
statistics, other things being equal*. * 

Finally, it should be recognized that the simulations were conducted 
.under more or less ideal conditions in regard to compa rabidity »of sample 

weights, a .situation that should theoretically favor OLS, which assumes equal 

♦ * 

sampling weights. While no marked disparities of sampling weights exist for 
the NLS data, some differences in weights have been introduced as a result of 
oversampling and adjustments for nonresponse (recall that the simulations did 
not address thfc problem of nonrespolise) . Under such conditions, bias in the 

OLS estimators may be expected to increase. 

» - . ,, . , 
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With an understanding of these additional considerations as well as the 
limitations of the results and drawing from results of both simulation studies, 
the general findings support the following conclusions. 

1) TSL, though not perfect, produces "good" conservative inferefices for 
regression coefficients (i.e., the probability of rejecting the null 
hypothesis, when true, is smaller than the nominal, value) when the * 
number of strata is moderately large (greater than 20). 
In some situations, the performance of TSL is less satisfactory when 
the number of strata is small (less than 10). 

OLS produces nonconservative inferences for regression coefficients 
(i.e., the probability of rejecting the null hypothesis, when true, 
is larger than the nominal value) ; .however , in some situations the 
, extent of nonconservatism is negligible/ 
4) While OLS compares favorably to TSL in the specific typical regression 
models considered, thei?e are indications that the extent of accepta- 
bility of the technique is situational. y 
In. the special case of domain means, the resists of OLS are generally 
poor, Moreover, in this situation, the performance of OLS- deteriorates 
considerably when the cluster size is' increased^ 
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"IV. SUMMARY AND CpNCLUSIONS 

' : Y ' 

The sample design for NLS is a deeply-stratified two-stage design with 
600 final strata. Although the original design called for 1,200 schools (2 
^t^er* stratum) and approximately 21,600 students (18 per school), the final 
sample, as defined from various sample. additions, included* 1,318 schools and 
23,451 .students. Tt*e information collected during the NLS base year and in 
subsequent follow\ip surveys represents a rich data source for addressing 
questions regarding the educational and occupational development of high 
school graduates* The types of statistics used to address these questions may 
vary from simple totals to ratio and 3 regression estimators; however, the 
problems of ^rawing valid and relevant inference that are common to all multi- 
stage sample, surveys must be fated. The "perfect"* answers to drawing infer- , 
ences for complex statistics from survey d^ta may not be readily available,^ 
but an applied scientist needs some good, though imperfect, techniques* to 
provide approximate quantitative measures for the errors in the estimates. 

This report has reviewed available theories and has Suggested a technique 
that, will be useful in analyzing NLS data with respect to regression podels. 
For drawing inferences, it is imp^ative * that, some estimate of the variance 
(mean square error) of the estimator; be computed from the data. For a simple 
linear function (such as a total or mean) of^^e^observations, it is possible 
to derive explicit algebraic expressions for estimating variances; however, it 
is not poss*51e to obtain such explicit expressions for variance estimates, of 
complex- estimators such as regression coefficients. The approximation procedures 
considered were: 01) Taylorized deviations, (2) independent replications, 
(3) balanced repeated replications, and (4) Jackknife- The Taylorized deviations 
approach is preferred for the following reasons: (l) f it is applicable to all 
designs- and statistics; (2) it provides "good" answers for "large" samples; 
(3)* it is economically and computationally feasible; and (4) it alone provides 
for estimation of variance components. 

Since the 'applicability of the Taylorized deviations approach is based on 
asymptotic theory, it^^erformance was evaluated empirically through simulation, 
using NLS data- Additibtiai simulations using another large data set were also 
considered.. Simulations were carried out using both Taylor Series Lineariza- 
tion (TSL), as defined %vl Appendix A, and Ordinary Least Squares {OLS).— Aside 
from potential .violations of assumptions of the relatively robust^ regression 
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model, OLS is obviously inappropriate for drawing inferences from complex 
samples, assuming as it 'does simple random sampling; nonetheless, the technique 
was considered because it is so widely known and used (even when theoretically 
inappropriate) and is so easily applied .through the more widely used statistical 
packages. 

The simulations, though limited in scope, do suggest th*t TSL performs 
extremely well in a la/rge variety of situations. Wijth a small (i.e., less 
than 10) number of strata, thexe i,s some 'deterioration in its performance in 
some cases, but there is a dramatic improvement in performance with pore than 
20 strata. "Both TSL and OLS show some bias in the estimation of regression 
coefficients; however, errors in inference using TSL are generally conservative, 
while the OLS approach generally yields rfonconservative results (i.e., statistical 
tests are likely to reject the null hypothesis more frequently than they 
should). In some typical regression situations, however, the nonconservatlsm 
of QLS is negligible, and the approach performs quite well. Nonetheless, 
there are clear indications that the ext^nt^to which OLS solutions approximate 
the theoretical F distribution is situational. OLS performs particularly 

poorly in*-the specific case in which a v regression model is applied to the 

• + 

estimation of domain means. 

•The various findings do suggest some practical recommendations to those 
who wish to use regression models in analysis of NLS data. In making generaj. 
recommendations for use of a- statistical methodology^ even ^for ,a specific 
survey or specific hypothJsisifesting, the performance of the methodology in a 
broad variety of situations^ relevant. Thus, if a methodology* is successful 
on one or two hypotheses for aSpecific survey, .there is no, logical justifica- 
tion that it will <per form \Jell for all iimilar. hypotheses , even with the same 
data. .On the other .hand, a methodology that is successful for several different 
hypotheses and different data sets may be expected to perform reasonably well 
rf in most *si tiiat ions. Moreover, a fairly general rule' in applied statistieS^is 
that, given equality in othet areas, recommended statistical methodologies* 
Which have potential for erring ^should e'rr in a^onservative direction. 

Under these guidelines, the TSL jproceduref can be recommended for the'NLS 

2 * 

data. In fact, tiie transformed Hotelling's T type statistic, using the TSL 
variance-covariance matrix, provides fairly robust multivariate inferences 
about regfessiob coefficients wifch a 'moderately large number of strata (i.e., 
24 or more)/ Although standard software for use of TSL is not widely available^ 



such software does exist (see Section 4l7c.l) . The procedure SURREGR described 
in Appendixes. C and D can generally be supported on a system supporting SAS; 
this procedure is available from the senior author of this report. 

Although OLS yielded gpod results for* some regression models in the 
simulations, it cannot be recommended for general use on the NLS data base, x 

.Not only is the OLS procedure logically poor when compared to TSL (OLS results l 
are necessarily biased when applied to the NLS design — see Nathan and Holt, 
19,80), but also it is nonconservative. The pbtential user of the NLS data 
base' may be tempted to use an OLS regression approach on the basis of the fact 
that OLS appeared to perform reasonably Veil in most simulations involving 
typical regression models . Such a decision would involve, of course, an 
element of risk, since 1 there is an indication that OLS does not perform equally 
v^ell for all models ,or designs. Moreover, the actual NLS data base differs 
from the NLS simulation in some important ways. Specifically, the -actual NLS 

✓data are based on larger cluster sizes and contain more disparate sample 
weights. 

"■»* Even though the OLS procedure- cannot be recommended for general use with 
NLS data, it should be noted that the principal purpose of this research was 
not to examine the robustness of OLS. Additional research obviously is needed 
to determine the condition© under which OLS regression solutions might acceptably 
approximate tho.se of more appropriate approaches. "Further, the recommendations 
provided* above have addressed the 1 situation of drawing inferences from a 
sample (i.e., estimating population parameters); however, many regression, 
studies -are not~ directed to this end. For such other uses of regression with 
the NLS data (e.g., sample-specific modeling, exploratory studies), the, use of . 
OLS may be mpre appropriate., but such uses also are beyond the scope of. this 
stuciy. In such cases, however, ths .potential user must recognize the clear* 

V 

distinction from estimating population functions. 
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Appendix A 

An Approximation to the Variance of 
Regression Coefficients in Sample Surveys 




Appendix A 

An Appijbximation to the Vaftance of. 
Regression Coefficients in Sample" Surveys 



In this appendix, £he-|>roblem of estimating the variance of a vector of 
regression coefficients in a complex sample is solved by first finding' a 
linear approximation to the estimator of the coefficients and then*\jsing this 
approximation to derive an approximation for the variance. 



I. THE LINEARIZATION TECHNIQUE 



The linearization technique employed i£ this paper is the Taylor Series 

expansion of the estimator/ Tepping (1968) first used this approach with 

special reference to regression coefficients Woodruff (1971) later elaborated 

it for, a Tbroad class of complex sample designs. In general, let u = (u^, 'u^, 

U 3* U k^ ' ^ a vec » tor of sample statistics and let U = (U^, l^, U^) * 

represent a ^vector of simple population parameters such that E[u,] = U. Let 

f(U) = (f^U), ... , f (U)) be a vector-valued function of U, which represents • 
1 p 

the p population parameters of interest. Assume that f(u) estinfates f(U). 

Now, f(u) is linearized J>y approximating it to its first-order Taylor 
Series expansion: 

U) + i- (u - U ) ~ (A.l) 

i=l i 



f(u) S f(U) 



or 



f(u) - f(U) S Z (u - u.V^Ml, 
. i=l i 



(A. 2) 



where 



3f(U) _ 
31L 



3f L (U) 

3U. ' 
l 



3U. 



(A.3) 



Since E[u. - l^] = 0, it can be shown that E[f(u) - f(U)] =*0, to the order of 
approximation indicated. Consequently, the matrix fond of the mean square , 
error, where VAR indicates. a va'riance-covariance matrix is 
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as. 



VAR[f(u) - f(U)]*= E[{f(u^j- f(U)} {f(u) - f(U)}'V ; 
Using (A. 2), (Pi A) can be approximated by 



k * 5 k 

VAR[f(u) - f(U)] = E[{ Z (u.-U.) 5|XHi} { 2 ( u -u .)^-} '] 

1 1 9U ' j=l J J 3U j 



i=l 



1 



(A. 4) 



(A. 5) 



Therefore, 



x VAR[£(u) 



(A. 6) 



where COV (.,.) is used* to indicate tile covariance of two entities. 

If k/is smal*L, (A. 6) is a convenient expression from whici^he variances 
of f(u) may be computed; however, if k is large (greater than 3', the formula 
becomes cumbersome. In this case, kn alternative approach uses the^actuail 
numerical value of the sum of the k linearized portions of (A. 2^ so that the 
variance-covaYiknce matrix of fl(u) may be evaluated directly. Explicitly, 
define a new column vector, W, vii£h p elements, 0 1 

1=1 1 
and observe that E[w] = 0. Now (1A.5) can be expressed as 
VAR[f(u) '- f(U)] = E[wvi'] = VAR[w] = VAR[z] , 



(A. 7) 



where 

.II. 



- v ~ 3f (U) 

z - . z : u i*-au- 

1»1 1 



(A-8) 



(A.9) 



APPLICATION OF TAYLORIZED {LINEARIZATION *T0 REGRESSION COEFFICIENTS 



A realistic^egression model J^ay be defined ^using the notation from 
Section II of the report as 



Y * 



+ e. 



I (A-10) 

Here e represents 'a ve&bqr of deviations from the linea^prediction equation. 
Ki'sh and Frankel's criterion minimizing the *sum of the squared deviations over 



the entire population yields a solution for B which is the familiar least . - 
squares solution to the normal equations: 

B^X'X)" 1 X'Y . | * C (A. ID 

Now suppose that a sample, S, is drawn from the popultion and let the 

subscript i refer to any population number. If the units are selected with 

probability, P^, then the unbiased Horvitz-Thompson estimators for X'X and X'Y 

are x'x and xTy. (Lower case letters indicate sampling statistics.) 
* 

* ' x'x = 0l (X.'X./P.) : . * (A. 12) 

ieS 1 1 1 

x y = 1 (X.'Y./P.) • - ,* (A-13) 

. oC i V i 

** ' < 

The summations extend over' units, i, belonging (e) -to the sample, S. The 

availability -3 of unbiased estimates for x'x and x'y allows the estimation of B 
wrth 

\b = (»'x) -1 (xy). • / . (A?H) 

Fro© (A. 11) it ^an be seen th^at^B is a function of X'X and X'Y while b is 
a function of x'x and x'y fr 001 (A. 14). -Jf it is assumed that there are p " 
independent variables in*, the model, then X'X and x x are pxp symmetric matrices. 
X'Y and x'y are pxl matrices. Let (X'X)^. f or (xV)^, represent the element 
of X'X or x'x in row j and column j'. Also let (X'Y) **or (x'y). locate the 
row j element of X'Y or x'y. Using the results presented in the previous 
section of this chapte'r,' the Jaylorized linearization of b can be written as - 

• • b S B'+ 2 ^ [(x'y). - (X'Y>.] ' (A. 15) 

~P P - " ' ' 

• • V^z/ i {(x'x).., - (x'x) ] — . • 

'For regression coefficients, Tepping and Woodruff . solved for the deriva**' 

tives numerically. However, Folsom (19,74) and Ful^er^ (1974) iiuiegeikjently m 

.developed ah analytical expression for the derivatives, which \s Wlif i*s<the 

*» • * « 

expression- in (A, 9). The remaining sections of this chapter follow Folsom 1 s 

work. The partial derivatives are derived in Appendix B, and only* the results 

>f are presented here. ' s ' . , ' < s 



For j =1, ...,'p-let d. be the pxl column vector with a 1 in row j and 
zeros in all other rows. Also define p(p+l)/2 symmetric matrices, D^,, with 
dimension pxp and with zeros everywhere except in row j, column j' and row j', 
column j. These locations contain l's. " » 

From Appendix B we have • 

' wmrrp - - • ■ ' (A - 16) 



and 



or 



|er|c 



(A. 18) 



oOTyT * ( J j • 

Substituting (A. 16) and (A.17) into (A.*5) yields- the approximation 

b S B + (X'X)' 1 I l(x'y). - (X'Y) ]d. 

j=l J J J s 

-(X'X) _1 1 I [(x'x) ,L- (X'X) ,)D ,B . 
Based on the definition of d: and Q.., , it^can be seen that - ^ 

I |(xy), - (X'Y).] d = x'y - X'Y ^ . (A.19) 

II Kx'x)., - (X'k)..,)^ = x'x - X'X . (A.20) 
j=lj'=j JJ " J 

Consequently, 

4 m , «if?B+ (X'X)" 1 [{x y X'Y} - {xVfXljBl \ ^ ' : (A-21) 

= B + (X'X)' 1 [x'y - (x'x)B + (X'X)B r X'Y] . ' (A.22) 

* Using the fact that (X'X)B = X'Y, «, * 

•b a B + Vx\ -1 [x'y - (x'«x)B] « . / (A. 23) 



b - B = (X'X)" 1 [x'y - (x'x)B] . . ' - (A ' 24) 

. ' 42 ■ ■ 
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III. APPLICATION TO THE STRATIFIED, TWO-STAGE SAMPLE DESIGN 

For the purpose of this report, a stratified, two-stage sample design is 
assumed. In this type of design, the population has been divided into H strata 
by population or demographic characteristics. For stratum h (h=l,...,H) ^ 
there are n(h\ primary sampling units, PSUs. The actual observations are 
nested within each PSU, and there*are n(h£) observations in PSU £ (£=1, . . . ,n(h)) 
within stratum h. 

Referring to the first section of this chapter and remembering that the 

>u. are sample statistics, consider the case in which each u. is a sum over * 
V i 1 0 ' * 

sample observations of random values. (The regression problem represents su^i^ 

a case.) 'Let u. (h£j) indicate the observation for individual j (j=l, . . . ,n(h£)) 

in PSU £ within stratum h. A 

For the stratified, two-stage sample design, 

H n(h) n(h£) 

u. = I I I u.(h£j) , (A-25) 
1 h=l £=1 j=l 1 

and now" (A. 9) can be rewritten as the vector 

k H itfh) ,n(h£} M nn 

z = I {2 1 I tt.CWJ)} . , (A.26) 

i=l h=l £=1 j=l i 



. Rearranging the order of summation, 

' H n(h) n(h£) k affm 
z = I 1 • I { 2 u.(h£j) } . ' , (A.27) 

h=l £=1 j=l ' i^l ° i % . . ^ 

.Consequently, another vector, z(h£), may bp defined a$ 



n(h£) k 



-z(h£) = *i { 2 ^(h£j).5|i"l } , r * (A.28)' 

j=l i=l i # * - 

where . , 

.H r>(ii) / ' 

' * z = I I z(h£) • - * \ (A. 29) 

h=l £=1 ** s ■ 

and o * v v 



H n(h) 

VAR{z] = VAR[ Z 2 z(h£)T , (A. 30) 

h=l £=1 

for this sample design. ' * < 

IV. THE GENERAL MEAN SQUARE .ERROR FOR REGRESSION- COEFFICIENTS 

A biased, with-replacement approximation to the variance in (A. 30) for a 
stratified two-stage sample design will be used. Gray (1975) states that the 
variance of a sample total from without-replacement sampling may be divided 
ipto a with-replacement variance component and a without-replacement covariance 
contribution at* the first stage. By ignoring this covarlance component, which, 
is usually negative*, a conservative approximation to the variance is obtained. 
It is usually assumed that this omitted finite population correction at the 
first stage is small and accounts for little of the total variance. This 
. approximation for VAR[z] is 

k H 

Z n(h)S (h), . , (A. 31) 



where 



S n(h) 

(§\h) = [ Z {z(hi>).~ i(hJ}{z(W) - i(h)}']/{n(h)-])}, (A.32) 
? 0=1 



and 



* . - » 

n(h) * . . • , : . . s » v 

5(h) = [ Z a(h£)r/n(h). . . • (A^S3) 

' " ' ■ • £=I ... • \ " • 

*,••../' * - 

The actual specification of the ^approximation for the- estimate of the 

variance of regress idn coefficients, in a stratified two-stage sample design 

* # * 

requires the definition' of the row vector (X(h£j) as the X values for qbserva- 

tion j in PSU £ and stratum h. -Correspondingly, Y(hilj) i$*the scalar response" * 

for a "particular. observation. From (A. 12) and (A.13), J . < 

\ • - r - ' * - 

S*-' n(h) n(hJl) - 
x'x = 1 • 2 Z X'(bJKj)X(h£j)/P(h£j) - (A.34) 

h=l A=l j=l- *. • • : . 

and - . * 



H n(h) n(h£) 
xy = 2 2 2 X'(hAj)Y(hAj)/P(h£j). 
h=l £=1 j=l 



(A. 35) 



From the 'previous section of this appendix, z(h£j) for the 'regression case can 
Jfre defined as 



z(h£j) = (X'X)" 1 [X'(h£j)Y(hAj)-X'(h£j)X(h£j)B]/P(h£j) 



(A. 36) 



Now, the expression for z(h£) can be Written with one last level of approxima- 
tion, which is imposed by substituting the estimates (x'x) and b, for (X'X) 



and B, respectively: 



n(h£) 



z(h£) = (xx)" 1 *. 2 [X'(h£j){Y(hJij) - X(h£j)b}]/P(h&j). . N (A.37) 

A convenient expression fqr (A. 37) is obtained by defining the vector 

r(h£j) = [X'(h2j){Y(hJlj) - X(hAj)b}]/P(h£j). (A. 38) 



Now (A. 37) may be written as— * 

. n(h£) ' 
z(h£) = (x'x)" 1 Z r(h£j) 
j=l 




(A. 39) 
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Appendix B 
derivation of the Partial Derivatives 



c " \3B 3B * 

• For the simplification of gg" and 8(X ' X )^ i ttf&Hot n ei&™*, 



•( 1, if j=r; A 
d (r) = \ 

J \ 0, otherwise; 

( i, if j = j'; 

^ to, otherwise. 



Also define p(p+l)/2 symmetric matrices, D. M , with dimension pxp and with < ^ 

JJ 1 ' 

zeros everywhere except in row j, column j'.and row j ! , cqlumn j. These * m 

^cations contain l ! s^ The element in row, r, and column, c, of D.^, cfcn 

be written as - . . /f 

^..,(rc) = (1 - d jjl )d j (r)d.,(c) + d.,(r)d.(c). m j 

t 

if*' * • 

Consider the partial derivatives of B'with refcpect to each element in X^Y * 
by taking thejpartials of both sides of the ^equality ; r - 



(X'X)B = X'Y, 

3(X*X)B . 3(X*Y) J=l, 2, ...,p, 

3 (X'Y). " 3(x'y).> ^ 
J J 

3(X;X)B _ d., j=l,2,...,p, . 
3(X'Y). " J 

(X'X)B _ d , j=l,2,r..,p, 

3(X'Y).~ J - » 

J ' 

88 = tx'xrV., i=i,2,..., P . 



. a(x'Y). j 1 

J 

The detivation of the partials for 'X'X is more complicated. Again begin 
with jthe equality and observe that the right, hand side is equal to zero 
after the derivatives with respect to each elemoflt of X'X are taken. • 



• ( 



3(X'X)B , 8(X^Y) 

3(X'X) , . . , 
JJ JJ . 



jj . j ,s j»...tP» • / 

3(x;x)b - _f a(x;x) 1 n . fYVl r bb i 

,3(X'X).', -L3(X;X>..J + (X 30 [3(X'X)..J- °» 

(X*X) 3B " _T 3(X^) 1 _ * 
3(X'X) jjt -L3(X'X)..,J fi " D jj' B - 

> • 

Consequently, 



3 fx 'X) " = "(^ ^) ^ D..,B, 4=1,2, .... ,p, 



( 



\ 



•US? 1 
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Appendix £ 



The Survey Regression Procedure 



Y 

This appendix provides a brief description of a f lexiblfe program developed 
by the authors for estimating regression parameters and associated standard- 
errors from data arising from survey samples. The- procedure employed is based 
on a Taylor Series Linearization approach r described in Appendix A. The |> ' 
program, entitled SURREGR, has been incorporated into an existing' statistic jj^ 

'analysis package, Statistical Analysis System (SAS); a. Users Manual for the 

••program is provided^in Appendix D. ' 

' "I. GENERA!. . *• 



One of the most difficult t-asls^in providing .a new, flexible,' statistical 
computer. p£dgra«5Vi^ about statistics , 

and statisticians >oci*l scientists who know/litjle about programming to • 
use it. Exporie»ce',ha.s4ho^h* '^^tati^a* E rpgra|ak that stand alone with 
their own specialized* contpV^a'r^ F° r maximum 

utility,, these programs need' .to. ope^e/ ^tMn^" 5 sy4teta: which fakes care of 
interfacing with the user; however,- it 4* exjhremely t&e-consuming to design' 
and implement such a system. Therefore, it/as .^cij^that the survey xegrgsSi 
Vrogram would be written '.to run under an existing- statistical system. 

* Several statistical packages , BMDP, Sf SS ; ;<%IRI. S I,. and, SASj.'were reviewed ^ . 
, and, among these,' it was determined that SAS p^sessed^he t best data management 
•capabilities:. The particular advantages of JJA^pse' as follows: 

a) the ability t<f extensively manin&late the* input data, T » 
' b) the immediate availability of other types^f. ^ statistical analyses, ^ 
' c) - free format^ of procedure information statement^,, 

* d) comprehensive, error checking* for data and procedure information - 

* • * « ' > * v * • : * • 

statements, * 

• e) procedure * output as" a SAS Sata set which i** available f or -further ; 

analysis; and - f% * * % 

**~ f) the^dynamic allocation of tore which e|fl£les, flexible programming 1 
" within system core atid time liMtatiora^* ^ 
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Furthermore, programming details and technical assistance were readily 
ayailable within the lo^al Computing facility, the Triangle Universities 
imputation Center (TUCC). Consequently, it was decided that* the survey 
regression program would be written as an SAS procedure. The SAS documenta- 
tion was provided by Barr, Goodnight , Sail, and Helwig (1976 and 1977). 



^ I J . COMPUTATIONAL PROCEDURES n . 

The survey regression procedure, SURREGR, has five, main functions: 
q-^ &) interpretation of user, input, * V. 

b) accumulation of sums of squares and 'cross products, . 

c) a solution for' the regression coefficients, 

d) ' general mean square errors , and ♦ 

t fc). 'tests of hypothesis. ^ V. * ^N^J 

The approach taken for each function is discussed in the following subsections. 

\ * v 

A. Interpretation of User Input ' * 

This function is controlled by the language module, which is independent 
of the computational pprt of the program and^ is responsible fbr the parsing of 
,the SAS language statements. Although the language module must be written in 
IBM 360 assembler language, SAS macros are provided for the standard parsing 
of the variable lists, options, anfd parameters; The philosophy for the parsing / 
of the model statement is borrowed from the SAS general linear models procedure, 
, GLM. TJie GLM language model was modified to allow for multiple model statements 
within one execution of SURREGR and to p'ermit effects and interactions formed 
by categorical dependent variables. For all categorical ^Variables that are 
declared as effects or interactions in the model statement, SM&EGR generates 9 
the required number of binary (0,1) variables, q ± . TheW dummy variables are 
defined atf " 

1, it an observation has a particular value for th£t •* ' ' 

■variable; ', • , * x (c>1) • 

0, otherwise. « . . J * . 

*■ 

Only after all information statements are parsed without error will SAS execute * 

the computational part of the program. 

* *«» . 



q i = 



' '4 
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B. Accumulation of Sums of Squares and Cross' Products ^ 
The X'WX and X'WY matrices are N computed as the secqnd main function o % f 

the procedure. These matrices are accumulated as sums of squares and cross 
products of variables over all observations. In other words, the actual X, Y, 
and W matrices are never formed. However, the W matrix can be represented as 
a square diagonal matrix with the numbe/ of rows and columns equal to the 
number of observations and with equal diagonal element containing an observa- 
tion's weight. The X and Y matrices are defined in Section II of the report 
with one .exception — there may be more than on! column allocated for an effect 
(which , may be a classif icatbry variable.) Inw^action^ among continuous and 
classificatory ^afciables are permitted. 

C. A Solution for the Regression Coefficients i 

i 

To compute such a solution, the inverse of X'WX must be found. r The ^ 
* > 
Cholesky decomposition technique described by Wilkinson and Reinsch (1971). is 

used to compute a standard matrix inverse unless X'WX is singular. In this 

case, a generalized inverse is computed. This inverse, A \ for a matrix, A, 

jnust satisfy the following conditions: 



i 



A = AA -1 A ', • * (C.2) 



and 



A' 1 = A'^AA* 1 , ' ■ m *^3) 



mm* 



k check for the numerical accuracy of equations (C.-2) and (C.3) is provided 
since some ill-conditioned matrices may be "subject Vo large numerical errors. 
Each term on the right-hand side of the equations is evaluated and compared 
with the corresponding element oo the left-hand side. The maximum difference 
found between any two elements of either comparison is reported to the user** 
If any deviatipn exceeds a set tolferance, the user is given a warning message; 
however, the* program will continue. Subsequently, the regression coefficients. > 
are compute^ by the 'formula in (A. 14^ given ^ in Appendix A. 

. "* * V : .'*:* \ 

D. General Mean Square Errors 
' — ' — — . 

This computation requires that the file be reread and that the Tajfiorized 
deviations defined in Appendix A, equation (A. 38), be computed. These deviations 



may be rewritten in the notation of this appendix as 

r(hJlj) = [X'(hAj){YhAj) > X(hAj)B}]W(h£j) , (CA) 

V 

where r(h£j) is a column vector- and W(h£j) = 1/P(h£j); other notations- are 
defined in Appendix 4 A and are not repeated here. The follewing sums and sums - 
of squares and cross-products are computed 

•n(h£) / v 

r"Ch£) Z r(h£j) , - ^ ( c - 5 ) 

j=l * 

. * 

' rr'(h) = Z. rChiDr'ChJi)., • , <£-6) 

J^l " / - ' . 

f 1 n(h) '"A" 

r(h) = 1 ,r(h£)r.'(h£)\ . (C7) 

The va^iance-covariance matrix is then accumulated over strata and adjusted by 
(X'WX)" 1 following these accumulations to yield the variance covarian'ce matrix 
2 

S,. Specifically, I ^ 

= (X'WJO" 1 [ I [{n(h)rr'(h) - r(h)r '(h)}/fn(h)-l}] ] (X'WX')" 1 (C.8) 
b ~ h=l 



E. Tests of Hypothesis j 

The last major function of the program is to compute the tests of hypo- 
thesis first for~the entire model and then for each effect. The null hypothe- 
sis for any of these *tests ; may be writfen as • 

H : B = B = ... =\ = 0,(for n>m) , 7 f (C9) 

o m m+1 n • 1 - 

against the alternative hypothesis / 

H l 2 B k*° (f0r j 6mB k> m - k l n)# • (C.10) 

For a particular hypothesis, the prpgr am determines its rank, d, and a dxp 
matrix, C, **ch- that -theygivea hypothesis is in the form CB = (h The value of ^ 
d is crm+1, if^H of the-pyameters, B m , B^, . . . B q , are estimable. 
Otherwise, the value of d is less than n^ta+1. , 

If the parameters were normally distributed, the test statistic would be 
a livelihood ratio, criterion" which would have an approximate F distribution 
for large samples. If Sg is the variance-covariance matrix of B witb degrees 



f 



& of freedom, e, equal to the number of PSU's minus the number of strata ,.<$Ken 
the test statistic'from Folsom«(1974) , * o 

• F , = (CB)'(C sJ)C)^(CB), (C.I1) 



d,e ed B 



is an approximate F with d and e degrees of freedom 5 under the nuLl hypothesis. 

. * III. DESIGN FEATURES " s 

i * - — 

*/The SURREGR procedur^ is*designed to produce a regression ana^sis for 
sample survey data. To achieve this end, a number of unique features have 
been incorporated into the program. The following attributes place SURREGR in 
a , class apaft from the standard regression packages of . BMDP, SPSS, and SAS: 

a) -SURREGR accounts for the correlation between observations due~"To the 
sample design. * g 

b) TIfere is no program limit to the number 9f models which may be 

-> >. , 

specified in otie procedure. c 

c) Effects and iriteraction^ ate allowable in independent and dependent * 
variables* 

d) Standard tests ofj hypbtheses * are provided, and^in . the case of a 

non-full-rank hypothesis, a test of the esttffiiable subhypotheses is 

made. ^ « 1 

" ' , # -1 - ' . * 

e) £hecks ^re made to [establish the condition of (X.WJQ 



f) SURREpR has the ability to select multiple random samples from a 
data file which is considered, to be a finite population"' This 
permits' empirical 'evaluation of the performance of the statistics 
and tests generated by tile program. \ 



# . 'Appendix D 

. User V Manual for the SURREGR Procedure 

I *** 

SURREGR is a procedure which provides a means of producing appropriate 
tests' of hypotheses for regression models in sample survey situations. The 
procedure offers many useful options and operates in three modes, which 
differ only in the method by which the varianc$-covariance matrix of the 
regression coefficients is calculated. SURREGR was developed principally, 
to handle regression analysis for sample survey data; hence, the default 
mode of the procedure will incorporate a stratified multistage sampling 
design into the variance-covariance computation. Another mode of the 
procedure relies on^he ordinary least squares estimate for the variance- 
covariancg nfatrix. Finally, a weight may be used for a weighted ordinary 
least squares 'analysis'. 

The Procedure SURREGR Statement 

PROC SURREGR options and parameters; 

The options and parameters for the 'PROC SURpGR statement are grouped 
by function. . y 

FILE OUTPUT * * < * 

l DATAOUT (abbreviated DOUT) 

• This option produces a SAS file which contains for each model "the 
regression coefficients, tKe variance-covariance matrix, the F test values, 
and their assocfated degrees of freedom. A nev* record is generated for 
each different value of* a dependent effect. The dat^a record structure and 
output variable names and descriptions follow: 

MODEL Model number ^ ~ 

DVAR Dependent variable number 

NCELL Number of columns of the X matrix * % 

NTESTS Number of F test values . i , 

CHECK * h This variable equals zero if the X'X inverse matrix is * 

acceptable. * » * "„ 



B001-B JThe regression coefficients (beta values). Each variable 

-pame fox a -regression-coefficient* starts with a B and ends" 
f with a three-digit number with leading zeros, which is the 
colump. of the X matirix to which the regression coefficient 
corresponds. B001, for exajnple, represents the ^Intercept 
• value if an intercept was included in the model. 
VOOl-V^ _ _The variance-povariance matrix of the regression coeffi- 
cients'. The matrix is output* in lower triangular form. by 
rows. The variable nam*? starts with a V and°ends with a 

T. \ - * 

three-digit number with leading zeros; which is the position 
of the variable in the lower triangular matrix. 
F001-F_ ^ _The F test values f rom* the tests of hypothesis for the 
entire model (F001) aq£ for each independent effect 

CF002-F J, x ' 

■ Q001-D^»_ JThe Hegregs of freedom associated with each ^ test value. s 
« It is important to realize that the- specification of the output data 
set .cannot be made with the standard SAS two-leveL format A separate 
parameter is* needed for each level. 

\. DDNAME=_ _ ^abbreviated DDN) ♦ * 

, This parameter is used with the DATAOUT option to specify 

the DDNAME in a JCL statement which describes ttfe OS data 

9 set for the output file. If DDNAME is omitted a temporary 

v ** file will be used. * * 

DSNAME= tabbreviated DSN) - 3 
. t 

. DSNAME is used'yifth the DATAOUT option, .It is a six-characte 
name for' the output data. set. If DSNAME is omitted the 
name, DUMMYM, will be generated by the procedure. Since 
• 1 . each different model produces a different data set, a two~ 

character suffix to the' six-character, data set name is added 
y - by the procedure to identify the model number. These two 

characters range from 01 through the 'number of models. 

RESIDUAL ' • 

This optipn allows for output to an SAS data\set of the unweighted 
predidt'ed and residual values associated with each level of each dependent 



effect for each model. There is one output record for each observation in 
the input file. The output variables afe: * ? 

^ M01PRD01-M01PRD The predicted values • The variable name begins with 

the letter M, followed by a tWo-digit number, the letters 

< ■* 

PRD, and a two-digit number with leading zeros for the 
dependent effect level. For "example, the predicted value 
for the second continuous dependent* variable in the fourth 
model .is M04PRD02., 

M01RSD01-M01RSD_ JThe residuals. t 

(Description the same as for predicted values.) 

0UT= This parameter is associated with the RESIDUAL option. It 

. provides the procedure with a standard one- or two-level SAS 
data set name. If it fis omitted the* next WORK data set will 
* be used. 

PRINTED OUTPUT ' * 

The hypothesis testing results and the checks on the inverse of X'X 
are printed by : default. 

. NOPRINT This option suppresses all printed output. 
" BETA; BETA prints a solution t'o the normal equations and the 

variance-covarianoe matrix for that solution. It should be 
noted that* singularities in the X matrix produce ^^respond- 
ing zeros \n the regression.' coefficients and in tWirariancer 
4 covariance matrix. There is no reparameterization. 

XPX This^optipn* prints the X X matrix and its -inverse ♦ 

MODES OF. OPERATION 6 

Computation of the variance-covariance matrix us^ng Taylorized devia- 
tions "and a sampling structure is default. - 
OLS OLS requests ordinary least squires analysis*. 

WLS , WLS Tequests weighted ordinary least squafes analysis. 
J" TAXWLS TAYWLS will compute *WLS and then will repeat the- analysis 
% using the sampling structure and Taylorized deviations. ' 
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FILE -INPUT " 

DATA= This parameter specifies a standard one- or two-level SAS 

* — w — * * - • 

data set name to 'b£ used by the procedure as tfce input data. 

If DATA are omitted, ttye current SAS data set will be used. 

* 

P ROGRAM CONTROL 

— p 

MISSPSU This Qption is for Taylor ized deviations and is only needed 
when no more than one PSU (primar^'sampling unit) in a 
'stratum has v^iid data. , % A a divisor used in^computing the 

variance-covariance matrix of the regression coefficients is 

\ 

corrected from the number of PSUs in a stratum with valid 

data minus 1 to*the total number of PSUs- in a stratum minus 1. 

TOL= The absolute tolerance used to compute all "relative tolerances 

* , -8 

in the program is set at 10 unless this parameter is % 

assigned a different value. , 

PLACES= _ _The 'nuab^r of digits used for all matrix printing is Set 

' to 8 unless this parameter is'aSsigned a different value. 

PROCEDURE INFORMATION STATEMENTS ^ 
Model Statement v / \ 

MOJEL dependent effects = independent effects/list of options; 

The MODEL statements allow the user to list one or multiple dependent & 
effects with any number of independent effects. An effect may be i single 
variable or a main effect, or it may be composed of a group of variables. 
When there is more, than one variable in an effect, each, variable must be 
joined to the next with either * indicating crossed variables or a ( ) 
indicating a nesting structure, *An effect may contain continuous or discrete 
variables, but only discrete variables may be nested. 'Variables which are 
combined into one effect must be listed .witti the tressed' and then the 
nested groupings. Only one level of nesting is allowed. 

Examples of correctly formed Qffe^ts: 

A*B A crossed with B. 

A(B) A nested within B. * 

A*B(C) A crossed with C nested within B. 

i 

A(B C) * A nested within B crossed with C. 



Nate that an * is not to* be used before the.( or 
en B and C. 

60 59 * 



between B and C. J^) j 



Examples ,of incorrectly formed effects: 

A*(B) • The * is not allowed. t 
A(B)*C • ■ — G rossing must be sp e cified befure me s t iffg - : 



X1-X10 or ^|A11 variables must be individually listed. 
A*(X1-X5) The „_„ option of SAS is nQt valid - n the -MODEL 

statement. 

"NOINT Only onp option is' availal)le^for a particular 

MODEL statement: Unless NOINT is specified, 
,9 SURREGR will assume an intercept for the model. 

. CLASSES STATEMENT (abbreviated 'CLASS) 

CLASSES list of variables-, in-order for a variable 
to be treated as discrete, it, must be in^ttie CLASS 
statement. CLASSES A1-A4 is a valid CLASS statement 
PSU. STATEMENT - * ~ - - - . «e 

PSU variable name; PSU gives the name of the vari- 
v 1 able containing' a numerical primary sampling unit 

*. iiidicator. - • f 

STRATUM STATEMENT (abbreviated STR) » *' 

STRATUM variable .name; STRATUM gives the name of . 
the variable containing a numerical representation . 
4 for each stratum. vRemember that' the data set must 

•be sorted by PSU within* stratum for a ?aylorized • 
deviation compiitatitfn of the variance-covariance 
Y matrix.'* ^ 

WEIGHT STATEMENT (abbreviated WT> J m m * 

WEIGHT variable name; WEIGHT gives, the name of the 
sampling weight variable. fc * ^ 
'LEVELS STATEMENT % V . 

LEVELS. list of numbers separated by blanks; a 
level is the number of values available for a 
particular discrete variable. That variable must 
be coded from 1 through the maximum value available. 
There must be a level specified for each variable 
listed in the CLASSES statement and the levels 

» 

must'be ordered exactly as the variables in the 
CLASSES statement. If there are variables in the 
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CLASSES statement which all have the same number 
of levels, then the* notation can be shortened. 
Four consecutive variables with two levels each 
may be/written as: 



LEVELS' 4*2 ; , or LEVELS 2.2 2 2. 



Certain modes of SURREGR require different procedure information 



statements: 



Statement 
MODEL 

Classes 
psu 

STRATUM 

WEIGHT 

LEVELS 



* Taylorized 
Deviations 

required 
optional 
" retired 
required 
inquired 



OLS 

required 
optional 
irrelevant * 
irrelevant . 
not allowed^ 



.^--required with the classes statement 



WLS 

required ' 
optional 
irrelevant 
irrelevant 
required 



' - TAYWLS 

required 
optional i 

, required. 

0 required 
required 



k COMPUTAT IONAL METHODS AND NOTES » * . 

* ; . * * * • 

The X Matrix "© ' 

The X matrix is a matrix with a row for each observation. The number 
of columns is the sum 6f the number of locations needed to hold each effect 
plus one column "for the intercept if necessary. A continuous main effect' 
or an effect with continuous variables crossed together requires* only "one 
column. A discrete main effect requires columns equal to the number of 
levels for-that variable. When discrete variables are crossed or nested, 
the. number of columns is equal -to the product of the levels for each variable 
The values of the effect are Jocated within, the program as Well as for 
output by varying the value of the last variable most rapidly. If an 
effect is defined at A*B*C where A has a levels, B has b levels, and C has 
c levels, then the actual location of an observation A=x, B=y, and C=z 
N within the a*b*c available locations is {x-l)*b*c + (y-l)*c + z. Note that 
X'X is accumulated once for all dependent variables in a model* In order 
to have different treatments for different dependent variables, a separate 
MODEL statement must be used for each dependent variable . 

/ 
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CHECKS ON THE INVERSE " 

The sums of squares and cross 4 products matrix for the independent 
Effects in a mpdel statement, X'X, is inverted as^^jpa'rt of -the least 
squares procedure. The inverse, XX'X)" 1 , ils a generalized inverse. There 
fs a check provided on «the condition of the inverse. Each element of X X 
is compared with that of X'X (X'X)" 1 X'X and then each element of (X'X)' • u 
is compared with that of (X'X)" 1 X'X (X'X)"* 1 . A relative deviation equal ' % 
ta (the check value minus the actual value )" divided by the. actual value is 
compared with the program 1 s .set tolerance timest 100. If any deviation 
exceeds' theSet, tolerance, the user is given a -warning message. 

THE VARIANCE- COVARIANCE MATRIX OF THE* REGRESSION COEFFICIENTS ) 

If no option relating to the variaricef-covariance matrix is specified, 
a between-PSU (primary sampling unit), within-stratum, generalized mean * 
square* error (GMSE) is computed. This GMSE^ is derived for the regressioa - 
prpble^i using' the technique of Taylo^ized linearization yielding a TayLorized 
dwiatioA whichis incorporated in the computations.* 

For thG OLS option, thfe-variance-covariance matrix is (X'X) a . 

= (Y'Y - b'X'Y) / (N-r) fc 

where Y^is'a vector* of all "observations for one dependent effect, b is a 
vector of regression coefficients for that dependent effect, N is the. r 
number pf observations , and r is the rank of X.^ . u * 

Tor. the'WLS option, the variante-covariance matrix has .the Same formula 
as "for OLS except tfiat each product, of 'dependent and independent effects * 
observation has been multiplied once by that observation's weight. 

■*> * 

s 

HYPOTHESIS TESTING # ^ 

1 The last major function of the program i? to compute "the tests of 

hypothesis first for th^ entire model and then for each effect. These 

tests exclude the intercept. The null hypothesis for any of these tests 

may be written as * % 

# 

. H B = B = . » * = 0(for n > m) , 
0 in m+l n 
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against the alternative hypothesis 



v 



" H, : B, f O'(for some k, m < k < n). ~~~~ 

1 K ^ ~ / ( 

For a particular hypothesis, the program determines its rank, d, of the 
estimable subspace of th^hypothesis and a dxp matrix, C, such that the 
parameters CB are e?timable and rank C is d. The value of d is n - n +'1, 
if all of the. parameters, B m / B p ... B q , are estimable. Otherwise, 
the value of d is less than n - m + 1. 

IF the parameters were normally' distributed, the test statistic would 

be a likelihood ratio criterion which would have. an- approximate F distribu- 

2 * * *• I 

tion for large samples. If S g is the variance-covairiance matrix of B with J 

degrees of freedom, e, equal to the number of PSUs minus .the number of 

strata minus the rank of X'X, then the test statistic, 

*d,e= CCW'CCsJcr 1 (CB) 



^Ls an approximate F with d and e degrees of freedom under the null hypoth< 
For OLS, e is equal to the number of observations minus the rank of X X. 
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